Natural Language Processing: Automatic detection of topics of dissatisfaction in reviews¶

This topic modeling was carried out on reviews extracted from the Yelp database. The Yelp database contains 6 990 279 reviews, of which 156 067 are restaurant reviews. The restaurant reviews contain 32 593 negative reviews, or 1 to 2 star reviews.

We will be carrying out the topic modeling on a sample of 5000 restaurant reviews from the 32 593 negative reviews filtered out from the original Yelp database.

1. Text preprocessing¶

In order to be able to analyze reviews for topics, or in other words perform a topic modeling, we will need to preprocess the text in the reviews. The technique chosen for this project is the following:

  1. Demojize the emojis, for example transform 🔥 into the word "fire
  2. Transform each review into lowercase
  3. Separate each review into a list of separate words or tokens (using a Regular Expressions tokenizer)
  4. Remove all english language stop words
  5. Lemmatize each review to remove for example conjugations (using the WordNet lemmatizer from NLTK), as lemmatization is preferred in this case for providing more readable words
  6. Remove words shorter than 4 characters
  7. Keep only nouns
text clean_text
0 We went here for brunch on a Sunday with a lar... [brunch, sunday, party, celebrate, family, tim...
1 Slackssss once you get trapped in line you wil... [line, mile]
2 Must have been seated by the managers nephew. ... [manager, personality, dish, hole, want, stran...
3 Sub-par. Go to Zaik instead. [zaik]
4 A co-worker told me about this new opening spo... [worker, spot, milk, check, california, hype, ...
... ... ...
4995 Very disappointing. Customer service was poor.... [customer, service, attention, waiter, bill, k...
4996 I used to go her ln highschool with my girlfri... [highschool, taco, taco, chicken, taco, burrit...
4997 WARNING!!! DO NOT ORDER FROM THIS PLACE!!!! I... [order, place, hotel, room, yelp, star, review...
4998 If you like eggs in a carton this place is for... [carton, place, toast, side, mystery, hubby, o...
4999 The 2 is for the food. Service is 1 at best. ... [service, order, water, order, chip, salsa, ch...

5000 rows × 2 columns

1. A. Our corpus of reviews before cleaning¶

The most frequent words are mostly stop words, these words are not at all helpful to us and show us the necessity of performing a text preprocessing before modeling the topics.

Number of tokens: 699680, Number of unique tokens: 20668
['We', 'went', 'here', 'for', 'brunch', 'on', 'a', 'Sunday', 'with', 'a', 'larger', 'party', 'to', 'celebrate', 'a', 'few', 'birthdays', 'in', 'my', 'family', ',', 'but', 'there', 'was', 'no', 'one', 'else', 'in', 'the', 'restaurant']

1. B. Our corpus of reviews after cleaning¶

There is a clear improvement in our corpus since we now only have restaurant-related words which can be used by our model to detect the topics.

Number of tokens: 98521, Number of unique tokens: 6939
['brunch' 'sunday' 'party' 'celebrate' 'family' 'time' 'reservation'
 'time' 'cake' 'dollar' 'person' 'cake' 'half' 'party' 'side' 'everyone'
 'plate' 'dish' 'salsa' 'mine' 'meal' 'waitress' 'chef' 'time' 'meal'
 'waitress' 'someone' 'meal' 'everyone' 'meal']

2. Training the LDA model¶

The LDA or the Latent Dirichlet Allocation model will allow us to detect topics in our reviews, sets of words that cluster together and have a higher probability to appear together in a review.

At first we will use a smaller sample of 1000 of the reviews in order to calculate the ideal number of topics, after which we will give our final model 5000 reviews to have the most precision in our topics.

2. A. Number of topics for the model¶

In order to calculate the ideal number of topics, we check the coherence score of our model. We have the option of using u_mass coherence, which is faster computationally but less accurate or to use c_v, which is slower but more accurate. I decided to use a sample of 1000 reviews and calculate the c_v, which explains the apparently lower than expected score. If we were to take a larger sample size or use the whole dataset of reviews, our c_v score would be much higher.

2. B. LDA Gensim with the final number of topics¶

Now that we have decided on 4 topics, the area of the graph where we see a clear elbow, we will use our final LDA model with 5000 reviews.

[(0,
  '0.039*"pizza" + 0.028*"chicken" + 0.025*"menu" + 0.023*"sauce" + 0.021*"taste" + 0.020*"meal" + 0.019*"price" + 0.019*"flavor" + 0.018*"meat" + 0.017*"dinner"'),
 (1,
  '0.110*"place" + 0.051*"restaurant" + 0.031*"thing" + 0.024*"review" + 0.022*"star" + 0.020*"year" + 0.019*"something" + 0.019*"night" + 0.018*"anything" + 0.015*"everything"'),
 (2,
  '0.105*"time" + 0.101*"order" + 0.079*"service" + 0.029*"experience" + 0.027*"location" + 0.027*"staff" + 0.022*"nothing" + 0.018*"drink" + 0.018*"server" + 0.018*"wait"'),
 (3,
  '0.041*"customer" + 0.039*"minute" + 0.029*"manager" + 0.024*"hour" + 0.017*"table" + 0.017*"business" + 0.016*"money" + 0.016*"employee" + 0.015*"someone" + 0.015*"room"')]

3. Conclusion¶

We can see four clear topics emerge:

  1. Dissatisfaction about the meal, the menu, the taste of the food (flavor, mediocre, many food items).
  2. Dissatisfaction about the restaurant, the quality (cost, dirty, look).
  3. Dissatisfaction about the experience, the wait, the order, the service.
  4. Dissatisfaction about the employees, the manager, the owner (attitude, problem, rude).

Computer Vision: Automatic image labeling¶

The data used for the automatic image labeling is the Yelp database, which contains 200 100 photos, with potential labels of: inside, outside, drink, food or menu.

In order to equally represent each label, we will use 100 images from each label in our sample (we used 200 in the Jupyter notebooks), so we will be working on 500 images total.

The objective of this exercise is to remove the original labels and to choose the most accurate method of automatically labeling the images.

Two methods have been tested to choose the best-performing method:

  1. KMeans clustering of ORB (Oriented FAST and Rotated BRIEF) descriptors
  2. Transfer Learning using Convolutional Neural Networks (VGG16)

1. KMeans clustering of ORB descriptors¶

Oriented FAST and Rotated BRIEF, or ORB for short, is an open source alternative to similar algorithms such as SIFT and SURF, which are patented. It uses FAST to detect the image keypoints and then computes the BRIEF descriptors. (More information here: https://docs.opencv.org/3.4/d1/d89/tutorial_py_orb.html)

1. A. Example of image preprocessing and computing descriptors¶

Before being able to detect descriptors in an image, we need to preprocess our images (or equalize their histograms, in our case). Let's take an image from our sample images and preprocess it:

We see that the histogram has been equalized, let's now use ORB to compute 100 features and show them on the image:

1. B. Clustering our images with the ORB descriptors¶

Each keypoint has an array of descriptors, which is the data we will ask KMeans to analyze and separate into 5 clusters (our 5 labels).

Dimension of descriptors: 500 descriptors of length 2000
First descriptor: [  1 249  27 ...  35  28 117]

We will then use PCA to reduce the dimension of our descriptors to explain 90% of the variance.

Dimension after PCA reduction :  (500, 352)

And finally we will use KMeans to cluster our descriptors into 5 clusters (representing food, drink, outside, inside, menu).

array([0, 2, 1, 2, 2, 3, 2, 0, 3, 4, 4, 1, 4, 1, 4, 0, 4, 0, 3, 0, 3, 2,
       1, 4, 3, 3, 3, 2, 3, 3, 2, 2, 4, 3, 4, 3, 0, 1, 1, 2, 4, 3, 3, 3,
       0, 3, 2, 2, 3, 0, 0, 1, 2, 3, 3, 0, 2, 4, 3, 0, 1, 1, 1, 0, 3, 3,
       4, 3, 3, 3, 3, 4, 1, 4, 2, 3, 1, 3, 2, 0, 3, 4, 3, 0, 4, 3, 0, 2,
       0, 4, 4, 0, 0, 3, 0, 4, 2, 3, 2, 3, 2, 0, 2, 3, 1, 2, 0, 4, 1, 0,
       0, 0, 2, 3, 4, 4, 1, 2, 3, 3, 3, 1, 0, 3, 3, 3, 4, 3, 3, 4, 3, 1,
       3, 0, 4, 3, 0, 3, 2, 3, 0, 0, 0, 2, 2, 3, 0, 2, 2, 1, 1, 3, 3, 0,
       3, 1, 4, 0, 3, 4, 4, 1, 3, 2, 2, 2, 4, 4, 1, 3, 4, 4, 2, 4, 1, 4,
       0, 4, 0, 0, 1, 4, 0, 3, 3, 2, 3, 0, 4, 4, 2, 4, 2, 2, 4, 4, 3, 0,
       1, 2, 0, 4, 1, 2, 0, 3, 3, 4, 4, 1, 2, 3, 1, 3, 4, 4, 2, 2, 4, 0,
       0, 2, 3, 1, 4, 0, 4, 4, 1, 4, 4, 0, 0, 4, 4, 3, 0, 0, 2, 0, 0, 2,
       4, 0, 0, 0, 0, 3, 2, 3, 4, 2, 3, 4, 2, 2, 2, 2, 3, 0, 2, 2, 3, 1,
       4, 4, 0, 3, 0, 4, 4, 4, 4, 1, 4, 4, 3, 2, 4, 0, 0, 3, 3, 2, 2, 3,
       0, 4, 0, 4, 4, 0, 0, 2, 0, 0, 4, 2, 0, 3, 2, 0, 4, 3, 3, 3, 2, 0,
       0, 3, 2, 0, 2, 4, 1, 0, 1, 3, 0, 0, 3, 2, 3, 4, 3, 2, 0, 1, 3, 4,
       3, 4, 3, 2, 3, 4, 2, 4, 0, 0, 2, 4, 1, 0, 0, 2, 4, 4, 4, 2, 4, 3,
       3, 3, 4, 4, 0, 3, 0, 0, 3, 1, 4, 4, 4, 3, 4, 3, 4, 4, 1, 3, 4, 3,
       1, 0, 1, 0, 3, 4, 4, 0, 4, 4, 2, 0, 0, 0, 4, 3, 4, 4, 0, 0, 4, 1,
       4, 1, 0, 4, 4, 3, 0, 4, 0, 3, 0, 0, 1, 2, 2, 3, 0, 4, 2, 4, 4, 2,
       0, 4, 2, 4, 0, 4, 4, 0, 2, 0, 0, 4, 3, 1, 4, 2, 0, 4, 4, 0, 3, 2,
       3, 0, 0, 2, 2, 1, 0, 4, 3, 0, 4, 2, 1, 4, 2, 0, 2, 2, 3, 0, 3, 4,
       0, 2, 2, 2, 0, 1, 1, 4, 1, 3, 2, 0, 0, 2, 2, 2, 3, 4, 2, 4, 3, 3,
       0, 2, 0, 2, 3, 4, 4, 4, 4, 1, 3, 4, 0, 4, 0, 1])
label    kmeans_label
drink    4               29
         0               27
         2               20
         3               17
         1                7
food     4               31
         0               24
         3               23
         2               12
         1               10
inside   3               33
         0               19
         2               18
         4               18
         1               12
menu     0               26
         4               26
         2               24
         3               15
         1                9
outside  3               26
         4               22
         0               20
         2               19
         1               13
Name: kmeans_label, dtype: int64

We can notice with the comparison between the true label and the kmeans label that there is no noticeable pattern, we cannot say that the kmeans label 1 is most often given to photos with a true label of "food" for example.

Let's visualize the data with t-SNE in two dimensions as a last attempt to find associations between the true and predicted labels.

ARI :  0.015019048620507871

1. C. Conclusion¶

The adjusted rand index (ARI) of the true labels vs. the predicted labels is 1%. This is a low number and it means our clustering with ORB predictors did not work as well as we hoped. In the next part, we will try to do something similar using a deep learning algorithm and hopefully get better results.

2. Transfer Learning with VGG16¶

A Convolutional Neural Network (CNN or ConvNet) is a type of neural network typically used in image recognition, image classification, object detections, face recognitions, etc. We will be using a neural network which has convolutional layers for feature extraction and pooling layers for reducing the dimensions of our feature maps.

We will not be using the fully connected layers or softmax because we will not be training the model at this point but using the model pretrained on the labeled ImageNet database, so we will be carrying out Transfer Learning, or reusing a previously trained model on a new problem.

Here is our model, where as expected we do not have the last fully connected and softmax layers:

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0         
                                                                 
 block3_conv1 (Conv2D)       (None, 56, 56, 256)       295168    
                                                                 
 block3_conv2 (Conv2D)       (None, 56, 56, 256)       590080    
                                                                 
 block3_conv3 (Conv2D)       (None, 56, 56, 256)       590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, 28, 28, 256)       0         
                                                                 
 block4_conv1 (Conv2D)       (None, 28, 28, 512)       1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, 28, 28, 512)       2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, 28, 28, 512)       2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, 14, 14, 512)       0         
                                                                 
 block5_conv1 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 7, 7, 512)         0         
                                                                 
=================================================================
Total params: 14,714,688
Trainable params: 0
Non-trainable params: 14,714,688
_________________________________________________________________

2. A. Example of feature extraction with VGG16¶

The shape of our extracted features from the sample image is (7,7,512), corresponding to the last layer of our model or the last "max pooling" layer in VGG16 diagram above.

1/1 [==============================] - 4s 4s/step
(1, 7, 7, 512)

2. B. Feature extraction on 500 photos with VGG16¶

ARI :  0.5632777907520027

2. C. Results¶

We achieve an ARI score of 56% which is pretty good considering the model isn't trained on our data and we have only given it 500 photos to work with. Thanks to the t-SNE plots, we can see which true labels correspond to the KMeans predicted labels. We are going to map them to their names in order to have a clear classification of our results.

Classification Report
              precision    recall  f1-score   support

       drink       0.80      0.84      0.82       100
        food       0.95      0.81      0.88       100
      inside       0.73      0.64      0.68       100
        menu       0.85      0.92      0.88       100
     outside       0.63      0.72      0.67       100

    accuracy                           0.79       500
   macro avg       0.79      0.79      0.79       500
weighted avg       0.79      0.79      0.79       500

******************************************************
Confusion Matrix

We have a total precision of 79%. The label predicted with most precision is "food" whilst the least precision was "outside". The labels most often confused are "inside" and "outside, as well as "food" and "drink".

2. D. Examples of mislabeled photos¶

"Inside" predicted as "Outside": 26 occurences

The reason for these mismatchings are probably the lighting in the photos. The second and fourth photo have a colder lighting that is similar to natural light. The first and third photo also have framed pictures which might be misinterpreted by the model, especially the third photo which has a framed picture of the sky with the sun.

"Outside" predicted as "Inside": 22 occurences

The first photo is probably mislabeled from the beginning, as it is truly a photo of an interior. The second photo is under some sort of roof which might be misinterpreted as a ceiling. The fourth photo also has most of the top area covered with an umbrella and a roof which might be misinterpreted as a ceiling.

"Food" predicted as "Drink": 12 occurences

The first and third photo both have a round shaped bowl of sauce which might be misinterpreted as a drink. The second has a similar plastic transparent cup and might be misinterpreted for resembling a small cup. The last photo might have been mislabeled due to the presence of a lime, which is often found on drinks.

"Drink" predicted as "Outside": 8 occurences

The first photo has very natural lighting which might be causing the mislabeling, and the second photo seems to have been taken outside. The items on the photo are not very identifiable so it isn't surprising that the model is having some trouble.

3. Conclusion¶

In synthesis, using the features extracted with the transfer learning model via VGG16 performed much better than using the features extracted through ORB. If we were to carry out a supervised learning on the VGG16 model and test it using not just 500 photos but thousands, we would certainly reach very satisfactory precision.